A Methodology for Spark Parameter Tuning

نویسندگان

  • Anastasios Gounaris
  • Jordi Torres
چکیده

Spark has been established as an attractive platform for big data analysis, since it manages to hide most of the complexities related to parallelism, fault tolerance and cluster setting from developers. However, this comes at the expense of having over 150 configurable parameters, the impact of which cannot be exhaustively examined due to the exponential amount of their combinations. The default values allow developers to quickly deploy their applications but leave the question as to whether performance can be improved open. In this work, we investigate the impact of the most important tunable Spark parameters with regards to shuffling, compression and serialization on the application performance through extensive experimentation using the Spark-enabled Marenostrum III (MN3) computing infrastructure of the Barcelona Supercomputing Center. The overarching aim is to guide developers on how to proceed to changes to the default values. We build upon our previous work, where we mapped our experience to a trial-and-error iterative improvement methodology for tuning parameters in arbitrary applications based on evidence from a very small number of experimental runs. The main contribution of this work is that we propose an alternative systematic methodology for parameter tuning, which can be easily applied onto any computing infrastructure and is shown to yield comparable if not better results than the initial one when applied to MN3; observed speedups in our validating test case studies start from 20%. In addition, the new methodology can rely on runs using samples instead of runs on the complete datasets, which render it significantly more practical.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spark Parameter Tuning via Trial-and-Error

Spark has been established as an attractive platform for big data analysis, since it manages to hide most of the complexities related to parallelism, fault tolerance and cluster setting from developers. However, this comes at the expense of having over 150 configurable parameters, the impact of which cannot be exhaustively examined due to the exponential amount of their combinations. The defaul...

متن کامل

Adaptive Tuning of Model Predictive Control Parameters based on Analytical Results

In dealing with model predictive controllers (MPC), controller tuning is a key design step. Various tuning methods are proposed in the literature which can be categorized as heuristic, numerical and analytical methods. Among the available tuning methods, analytical approaches are more interesting and useful. This paper is based on a proposed analytical MPC tuning approach for plants can be appr...

متن کامل

Mathematical Modeling and Analysis of Spark Erosion Machining Parameters of Hastelloy C-276 Using Multiple Regression Analysis (RESEARCH NOTE)

Electrical discharge machining has the capability of machining complicated shapes in electrically conductive materials independent of hardness of the work materials. This present article details the development of multiple regression models for envisaging the material removal rate and roughness of machined surface in electrical discharge machining of Hastelloy C276. The experimental runs are de...

متن کامل

Air-Fuel Ratio Control of a Lean Burn SI Engine Using Fuzzy Self Tuning Method

Reducing the exhaust emissions of an spark ignition engine by means of engine modifications requires consideration of the effects of these modifications on the variations of crankshaft torque and the engine roughness respectively. Only if the roughness does not exceed a certain level the vehicle do not begin to surge. This paper presents a method for controlling the air-fuel ratio for a lean bu...

متن کامل

Efficient and Robust Parameter Tuning for Heuristic Algorithms

The main advantage of heuristic or metaheuristic algorithms compared to exact optimization methods is their ability in handling large-scale instances within a reasonable time, albeit at the expense of losing a guarantee for achieving the optimal solution. Therefore, metaheuristic techniques are appropriate choices for solving NP-hard problems to near optimality. Since the parameters of heuristi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Big Data Research

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2018